On the fly generation of multimedia code for image processing

ABSTRACT

A method and apparatus for processing multimedia instruction enhanced data by the use of an abstract routine generator and a translator. The abstract routine generator takes the multimedia instruction enhanced data and generates abstract routines to compile the multimedia instruction enhanced data. The output of the abstract generator is an abstract representation of the multimedia instruction enhanced data. The translator then takes the abstract representation and produces code for processing.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 09/614,635 filed Jul. 12, 2000, which is incorporated herein in its entirety by this reference thereto.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to the processing of multimedia data with processors that feature multimedia instruction enhanced instruction sets. More particularly, the invention relates to a method and apparatus for generating processor instruction sequences for image processing routines that use multimedia enhanced instructions.

2. Description of the Prior Art

In general, most programs that use image processing routines with multimedia instructions do not use a general-purpose compiler for these parts of the program. These programs typically use assembly routines to process such data. A resulting problem is that the assembly routines must be added to the code manually. This step requires high technical skill, is time demanding, and is prone to introduce errors into the code.

In addition, different type of processors, (for example, Intel's Pentium I w/MMX and Pentium II, Pentium III, Willamette, AMD's K-6 and AMD's K-7 aka. Athlon) each use different multimedia command sets. Examples of different multimedia command sets are MMX, SSE and 3DNow. Applications that use these multimedia command sets must have separate assembly routines that are specifically written for each processor type.

At runtime, the applications select the proper assembly routines based on the processor detected. To reduce the workload and increase the robustness of the code, these assembly routines are sometimes generated by a routine specific source code generator during program development.

One problem with this type of programming is that the applications must have redundant assembly routines which can process the same multimedia data, but which are written for the different types of processors. However, only one assembly routine is actually used at runtime. Because there are many generations of processors in existence, the size of applications that use multimedia instructions must grow to be compatible with all of these processors. In addition, as new processors are developed, all new routines must be coded for these applications so that they are compatible with the new processors. An application that is released prior to the release of a processor is incompatible with the processor unless it is first patched/rebuilt with the new assembly routines.

It would be desirable to provide programs that use multimedia instructions which are smaller in size. It would be desirable to provide an approach that adapts such programs to future processors more easily

SUMMARY OF THE INVENTION

In accordance with the invention, a method and apparatus for generating assembly routines for multimedia instruction enhanced data is shown and described.

-   -   An example of multimedia data that can be processed by         multimedia instructions are the pixel blocks used in image         processing. Most image processing routines operate on         rectangular blocks of evenly sized data pieces (e.g. 16×16 pixel         blocks of 8 bit video during MPEG motion compensation). The         image processing code is described as a set of source blocks,         destination blocks and data manipulations. Each block has a         start address, a pitch (distance in bytes between two         consecutive lines) and a data format. The full processing code         includes width and height as additional parameters. All of these         parameters can either be integer constants or arguments to the         generated routine. All data operations are described on SIMD         data types. A SIMD data type is a basic data type (e.g. signed         byte, signed word, or unsigned byte) and a number or repeats         (e.g. 16 pixels for MPEG Macroblocks). The size of a block         (source or destination) is always the size of its SIMD data type         times its width in horizontal direction and the height in         vertical direction.

In the presently preferred embodiment of the invention, an abstract image generator inside the application program produces an abstract routine representation of the code that operates on the multimedia data using SIMD operations. A directed acyclic graph is a typical example of a generic version. A translator then generates processor specific assembly code from the abstract respresentation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computer system that may be used to implement a method and apparatus embodying the invention for translating a multimedia routine from its abstract representation generated by an abstract routine generator inside the application's startup code into executable code using the code generator.

DESCRIPTION OF THE PREFERRED EMBODIMENT

In FIG. 1 the startup code 11 of the application program 13, further referred to as the abstract routine generator, generates an abstract representation 15 of the multimedia routine represented by a data flow graph. This graph is then translated by the code generator 17 into a machine specific sequence of instructions 19, typically including several SIMD multimedia instructions. The types of operations that can be present inside the data flow graph include add, sub, multiply, average, maximum, minimum, compare, and, or, xor, pack, unpack and merge operations. This list is not exhaustive as there are operations currently performed by MMX, SSE and 3DNow for example, which are not listed. If a specific command set does not support one of these operations, the CPU specific part of the code generator replaces it by a sequence of simpler instructions (e.g. the maximum instruction can be replaced by a pair of subtract and add instruction using saturation arithmetic).

-   -   The abstract routine generator generates an abstract         representation of the code, commonly in the form of a directed         acyclic graph during runtime. This allows the creation of         multiple similar routines using a loop inside the image         processing code 21 for linear arrays, or to generate routines on         the fly depending on user interaction. E.g. the bidirectional         MPEG 2 motion compensation can be implemented using a set of         sixty-four different but very similar routines, that can be         generated by a loop in the abstract image generator. Or an         interactive paint program can generate filters or pens in the         form of abstract representations based on user input, and can         use the routine generator to create efficient code sequences to         perform the filtering or drawing operation. Examples of the data         types processed by the code sequences include: SIMD input data,         image input data and audio input data.

Examples of information provided by the graphs include the source blocks, the target blocks, the change in the block, color, stride, change in stride, display block, and spatial filtering.

The accuracy of the operation inside the graphs can be tailored to meet the requirements of the program. The abstract routine generator can increase its precision by increasing the level of arithmetics per pixel. For example, 7-bit processing can be stepped up to 8-bit, or 8-bit to 16-bit. E.g. motion compensation routines with different types of rounding precision can be generated by the abstract routine generator.

The abstract representation, in this case the graph 15, is then sent to the translator 17 where it is translated into optimized assembly code 19. The translator uses standard compiler techniques to translate the generic graph structure into a specific sequence of assembly instructions. As the description is very generic, there is no link to a specific processor architecture, and because it is very simple it can be processed without requiring complex compiler techniques. This enables the translation to be executed during program startup without causing a significant delay. Also, the abstract generator and the translator do not have to be programmed in assembly. The CPU specific translator may reside in a dynamic link library and can therefore be replaced if the system processor is changed. This enables programs to use the multimedia instructions of a new processor, without the need to be changed.

Tables A-C provide sample code that generates an abstract representation for a motion compensation code that can be translated to an executable code sequence using the invention. TABLE A #ifndef MPEG2MOTIONCOMPENSATION_H #define MPEG2MOTIONCOMPENSATION_H #include “driver\softwarecinemaster\common\prelude.h” #include “..\..\BlockVideoProcessor\BVPXMMXCodeConverter.h”   //   // Basic block motion compensation functions   // class MPEG2MotionCompensation   {   protected:       //       // Function prototype for a unidirectional motion compensation routine       //     typedef void (_stdcall * CompensationCodeType)(BYTE * source1Base, int sourceStride,     BYTE * targetBase, short * deltaBase, int deltaStride,     int num);       //       // Function prototype for a bidirectional motion compensation routine       //     typedef void (_stdcall * BiCompensationCodeType)(BYTE * source1Base, BYTE * source2Base, int sourceStride,     BYTE * targetBase, short * deltaBase, int deltaStride,     int num);       //       // Motion compensation routines for unidirectional prediction. Each routine       // handles one case. The indices are       // - y-uv : if it is luma data the index is 0 otherwise 1       // - delta : error correction data is present (eg. the block is not skipped)       // - halfy : half pel prediction is to be performed in vertical direction       // - halfx : half pel prediction is to be performed in horizontal direction       //     CompensationCodeType   compensation[2][2][2][2];     // y-uv delta halfy halfx     BVPCodeBlock *   compensationBlock[2][2][2][2];       //       // Motion compensation routines for bidirectional prediction. Each routine       // handles one case. The indices contain the same parameters as in the       // unidirectional case, plus the half pel selectors for the second source       //     BiCompensationCodeType   bicompensation[2][2][2][2][2][2]; // y-uv delta half1y half1x half2y half2x     BVPCodeBlock *   bicompensationBlock[2][2][2][2][2][2];   public:       //       // Perform a unidirectional compensation       //     void MotionCompensation(BYTE * sourcep, int stride, BYTE * destp, short * deltap, int dstride, int num, bool uv, bool delta, int halfx, int halfy)       {       compensation[uv][delta][halfy][halfx](sourcep, stride, destp, deltap, dstride, num);       }       //       // Perform bidirectional compensation       //     void BiMotionCompensation(BYTE * source1p, BYTE * source2p, int stride, BYTE * destp, short * deltap, int dstride, int num, bool uv, bool delta, int half1x, int half1y, int half2x, int half2y)       {   bicompensation[uv][delta][half1y][half1x][half2y][half2x](source1 p, source2p, stride, destp, deltap, dstride, num);       }     MPEG2MotionCompensation(void);     ˜MPEG2MotionCompensation(void);   }; #endif

TABLE B #include “MPEG2MotionCompensation.h” #include “../../BlockVideoProcessor/BVPXMMXCodeConverter.h” // // Create the dataflow to fetch a data element from a source block, // with or without half pel compensation in horizontal and/or // vertical direction. // BVPDataSourceInstruction * BuildBlockMerge (BVPSourceBlock * source1BlockA, BVPSourceBlock * source1BlockB, BVPSourceBlock * source1BlockC, BVPSourceBlock * source1BlockD, int halfx, int halfy) { if (halfy) { if (halfx) { // // Half pel prediction in h and v direction, the graph part looks like this

return   new BVPDataOperation ( BVPDO_AVG, new BVPDataOperation ( BVPDO_AVG, new BVPDataLoad(source1BlockA), new BVPDataLoad(source1BlockB) ), new BVPDataOperation ( BVPDO_AVG, new BVPDataLoad(source1BlockC) new BVPDataLoad(source1BlockD) ) ); } else {

return    new BVPDataOperation ( BVPDO_AVG, new BVPDataLoad(source1BlockA), new BVPDataLoad(source1BlockC) ); } } else { if (halfx) {

return new BVPDataOperation ( BVPDO_AVG, new BVPDataLoad(source1BlockA), new BVPDataLoad(source1BlockB) ); } else { // // Full pel prediction // // <−− (LOAD source1BlockA) // return new BVPDataLoad(source1BlockA); } } } MPEG2MotionCompensation::MPEG2MotionCompensation(void) { int yuv, delta, halfy, halfx, half1y, half1x, half2y, half2x; BVPBlockProcessor *   bvp; BVPCodeBlock       *   code; BVPArgument * source1Base; BVPArgument * source2Base; BVPArgument * sourceStride; BVPArgument * targetBase; BVPArgument * deltaBase; BVPArgument * deltaStride; BVPArgument * height; BVPSourceBlock * source1BlockA; BVPSourceBlock * source1BlockB; BVPSourceBlock * source1BlockC; BVPSourceBlock * source1BlockD; BVPSourceBlock * source2BlockA; BVPSourceBlock * source2BlockB; BVPSourceBlock * source2BlockC; BVPSourceBlock * source2BlockD; BVPSourceBlock * deltaBlock; BVPTargetBlock * targetBlock; BVPDataSourceInstruction * postMC; BVPDataSourceInstruction * postCorrect; BVPDataSourceInstruction *  deltaData; // // Build unidirectional motion compensation routines // for (yuv = 0; yuv<2; yuv++) { for(delta=0; delta<2; delta++) { for(halfy=0; halfy<2; halfy++) { for(halfx=0; halfx<2; halfy++) { bvp = new BVPBlockProcessor( ); bvp−>AddArgument (height    = new BVPArgument (false)); bvp−>AddArgument (deltaStride  = new BVPArgument (false)); bvp−>AddArgument (deltaBase   = new BVPArgument (true)); bvp−>AddArgument (targetBase   = new BVPArgument (true)); bvp−>AddArgument (sourceStride  = new BVPArgument (false)); bvp−>AddArgument (source1Base  = new BVPArgument (true)); // // Width is always sixteen pixels, so one vector of sixteen unsigned eight bit elements, // height may vary, therefore it is an argument // bvp−>SetDimension (1, height); // // Four potential source blocks, B is one pel to the right, C one down and D right and down // bvp−>AddSourceBlock (source1BlockA = new BVPSourceBlock (source1Base, sourceStride,  BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp−>AddSourceBlock(source1BlockB = new BVPSourceBlock (BVPPointer (source1Base, 1 + yuv), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp−>AddSourceBlock(source1BlockC = new BVPSourceBlock (BVPPointer (source1Base, sourceStride, 1, 0), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp−>AddSourceBlock (source1BlockD = new BVPSourceBlock (BVPPointer (source1Base, sourceStride, 1, 1 + yuv), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); // // If we have error correction data, we need this source block as well // if (delta) bvp−>AddSourceBlock (deltaBlock  = new BVPSourceBlock (deltaBase, deltaStride, BVPDataFormat (BVPDT_S16, 16), 0x10000)); // // The target block to write the data into // bvp−>AddTargetBlock (targetBlock = new BVPTargetBlock (targetBase, sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); // // Load a source block base on the half pel settings // bvp−>AddInstruction (postMC = BuildBlockMerge (source1BlockA, source1BlockB, source1BlockC, source1BlockD, halfx, halfy)); if (delta) { deltaData = new BVPDataLoad (deltaBlock); if (yuv) { // // It is chroma data and we have error correction data. The u and v // parts have to be interleaved, therefore we need the merge instruction // // (CONV S16)<−−postMC

bvp−>AddInstruction ( postCorrect = new BVPDataConvert ( BVPDT_U8, new BVPDataOperation ( BVPDO_ADD, new BVPDataConvert ( BVPDT_S16, postMC ), new BVPDataMerge ( BVPDM_ODDEVEN, new BVPDataSplit ( BVPDS_HEAD, deltaData ), new BVPDataSplit ( BVPDS_TAIL, deltaData ) ) ) ) ); } else { // // It is luma data with error correction

(LOAD delta) // bvp−>AddInstruction ( postCorrect = new BVPDataConvert ( BVPDT_U8, new BVPDataOperation ( BVPDO_ADD, new BVPDataConvert ( BVPDT_S16, postMC ), deltaData ) ) ); } // // Store into the target block // // (STORE targetBlock)<−−. . . // bvp−>AddInstruction ( new BVPDataStore ( targetBlock, postCorrect ) ); } else { // // No error correction data, so store motion result into target block // // (STORE targetBlock)<−−. . . // bvp−>AddInstruction ( new BVPDataStore ( targetBlock, postMC ) ); } BVPXMMX CodeConverter conv; // // Convert graph into machine language // compensationBlock [yuv] [delta] [halfy] [halfx] = code = conv.Convert (bvp); // // Get function entry pointer // compensation [yuv] [delta] [halfy] [halfx] = (CompensationCodeType) (code−>GetCodeAddress()); // // delete graph // delete bvp; } } } } // // build motion compensation routines for bidirectional prediction // for (yuv = 0; yuv<2; yuv++) { for (delta=0; delta<2; delta++) { for (half1y=0; half1y<2; half1y++) { for (half1x=0; half1x<2; half1x++) { for (half1x=0; half2y<2; half2y++) { for (half2x=0; half2x<2; half2x++) { bvp = new BVPBlockProcessor(); bvp−>AddArgument (height = new BVPArgument (false)); bvp−>AddArgument (deltaStride = new BVPArgument (false)); bvp−>AddArgument (deltaBase = new BVPArgument (true)); bvp−>AddArgument (targetBase = new BVPArgument (true)); bvp−>AddArgument (sourceStride = new BVPArgument (false)); bvp−>AddArgument (source2Base = new BVPArgument (true)); bvp−>AddArgument (source1Base = new BVPArgument (true)); bvp−>SetDimension (1, height); // // We now have two source blocks, so we need eight blocks for the half pel // prediction // bvp- >AddSourceBlock(source1BlockA = new BVPSourceBlock(source1Base, sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock(source1BlockB = new BVPSourceBlock(BVPPointer (source1Base, 1 + yuv), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock (source1BlockC = new BVPSourceBlock(BVPPointer (source1Base, sourceStride, 1, 0), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock (source1BlockD = new BVPSourceBlock (BVPPointer (source1Base, sourceStride, 1, 1 + yuv), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock (source2BlockA = new BVPSourceBlock (source2Base, sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock (source2BlockB = new BVPSourceBlock (BVPPointer (source2Base, 1 + yuv), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock (source2BlockC = new BVPSourceBlock (BVPPointer (source2Base, sourceStride, 1, 0), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); bvp- >AddSourceBlock (source2BlockD = new BVPSourceBlock (BVPPointer (source2Base, sourceStride, 1, 1 + yuv), sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); if (delta) bvp- >AddSourceBlock (deltaBlock   = new BVPSourceBlock (deltaBase, deltaStride, BVPDataFormat (BVPDT_S16, 16), 0x10000)); bvp- >AddTargetBlock (targetBlock   = new BVPTargetBlock (targetBase, sourceStride, BVPDataFormat (BVPDT_U8, 16), 0x10000)); // // Build bidirectional prediction from two unidirectional predictions //

bvp−>AddInstruction ( postMC = new BVPDataOperation ( BVPDO_AVG, BuildBlockMerge (source1BlockA, source1BlockB, source1BlockC, source1BlockD, half1x, half1y), BuildBlockMerge (source2BlockA, source2BlockB, source2BlockC, source2BlockD, half2x, half2y) ) ); // // Apply error correction, see unidirectional case // if (delta) { deltaData = new BVPDataLoad (deltaBlock); if (yuv) { bvp- >AddInstruction ( postCorrect = new BVPDataConvert ( BVPDT_U8, new BVPDataOperation ( BVPDO_ADD, new BVPDataConvert ( BVPDT_S16, postMC ), new BVPDataMerge ( BVPDM_ODDEVEN, new BVPDataSplit ( BVPDS_HEAD, deltaData ), new BVPDataSplit ( BVPDS_TAIL, deltaData ) ) ) ) ); } else { bvp- >AddInstruction ( postCorrect = new BVPDataConvert ( BVPDT_U8, new BVPDataOperation ( BVPDO_ADD, new BVPDataConvert ( BVPDT_S16, postMC ), deltaData ) ) ); } bvp−>AddInstruction ( new BVPDataStore ( targetBlock, postCorrect ) ); } else { bvp−>AddInstruction ( new BVPDataStore ( targetBlock, postMC ) ); } BVPXMMXCodeConverter conv; // // Translate routines // bicompensationBlock [yuv] [delta] [half1y] [half1x] [half2y] [half2x] = code = conv.Convert (bvp); bicompensation [yuv] [delta] [half1y] [half1x] [half2y] [half2x] = (BiCompensationCodeType) (code−>GetCodeAddress()); delete bvp; } } } } } } } MPEG2MotionCompensation::-MPEG2MotionCompensation (void) { int yuv, delta, halfy, halfx, half1y, half1x, half2y, half2x; // // free all motion compensation routines // for (yuv = 0; yuv<2; yuv++) { for (delta=0; delta<2; delta++) { for (halfy = 0; half<2; halfy++) { for (halfx=0; halfx<2; halfx++) { delete compensationBlock [yuv] [delta] [halfy] [halfx]; } } } } for (yuv = 0; yuv<2; yuv++) { for (delta=0; delta<2; delta++) { for (half1y=0; half1y<2; half1y++) { for (half1x=0; half1x<2; half1x++) { for (half2y=0; half2y<2; half2y++) { for (half2x=0; half2x<2; half2x++) { delete bicompensationBlock [yuv] [delta] [half1y] [half1x] [half2y] [half2x]; } } } } } } }

Table C

Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the claims included below. 

1. An apparatus for generating processor-specific assembly code dynamically, comprising: an abstract routine generator within a program executing on a computer, said abstract routine generator for receiving a data stream comprising a multimedia routine and for outputting a generic abstract representation thereof at program startup; and a translator for said abstract routine generator within said program for receiving said abstract representation and for outputting processor specific final code for processing multimedia input data at program startup.
 2. The apparatus of claim 1, where in said abstract routine generator builds an abstract routine during program runtime.
 3. The apparatus of claim 1, wherein said abstract routine generator builds an abstract routine in the form of a graph.
 4. The apparatus of claim 1 wherein said multimedia data comprise SIMD input data.
 5. The apparatus of claim 1, wherein said multimedia data comprise image input data.
 6. The apparatus of claim 1, wherein said multimedia data comprise audio input data.
 7. The apparatus of claim 3, wherein said graph is input to said translator.
 8. The apparatus of claim 3, wherein the output of said translator is in assembly code.
 9. The apparatus of claim 1, wherein said translator's configuration can be changed by use of a dynamic library link.
 10. The apparatus of claim 1, wherein said processor-specific code performs any of the operations of add, sub, multiply, average, maximum, minimum, compare, and, or, xor, pack, unpack, and merge on said input data.
 11. The apparatus of claim 3, wherein said graph is a function of any of source block, target block, change in the block, color, stride, change in stride, display block, and spatial filtering.
 12. A method for generating processor-specific assembly code dynamically, comprising: providing a computer executing a program including an abstract routine generator for receiving a data stream comprising a multimedia routine and for generating a generic abstract representation thereof, at program startup; and said program including a translator for receiving said abstract representation from said abstract routine generator; and outputting processor-specific code final code for processing multimedia input data at program startup.
 13. The method of claim 12, wherein said abstract routine generator builds the abstract routine during program runtime.
 14. The method of claim 13, wherein said abstract routine is a graph.
 15. The method of claim 12, wherein said multimedia input data comprise SIMD data.
 16. The method of claim 12, said multimedia input data comprise image data.
 17. The method of claim 12, wherein said multimedia input data comprise audio data.
 18. The method of claim 14, wherein said graph is input to said translator.
 19. The method of claim 12, wherein the output of said translator is assembly code.
 20. The method of claim 12, wherein said processor-specific code performs any of the operations of add, sub, multiply, average, maximum, minimum, compare, and, or, xor, pack, unpack, and merge on said multimedia input data.
 21. The method of claim 14, wherein said graph is a function of any of source block, target block, change in the block, color, stride, change in stride, display block, and spatial filtering.
 22. The method of claim 12, wherein said translator can be changed by use of a dynamic library link. 